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Mixtures of Gaussian (or normal) distributions arise in a variety 
of application areas. Many heuristics have been proposed for the task 
of finding the component Gaussians given samples from the mixture, 
such as the EM algorithm , a local-search heuristic from Dempster, 

Laird and Rubin [J. Roy. Statist. Soc. Ser. B 39 (1977) 1-38]. These 
do not provably run in polynomial time. 

We present the first algorithm that provably learns the component 
Gaussians in time that is polynomial in the dimension. The Gaussians 
may have arbitrary shape, but they must satisfy a “separation condi¬ 
tion” which places a lower bound on the distance between the centers 
of any two component Gaussians. The mathematical results at the 
heart of our proof are “distance concentration” results—proved using 
isoperimetric inequalities—which establish bounds on the probabil¬ 
ity distribution of the distance between a pair of points generated 
according to the mixture. 

We also formalize the more general problem of max-likelihood fit 
of a Gaussian mixture to unstructured data. 

1. Introduction. Finite mixture models are ubiquitous in a host of areas 
that use statistical techniques, including artificial intelligence (AI), com¬ 
puter vision, medical imaging, psychology and geology (see [15, 23]). A mix¬ 
ture of distributions T>i 1 T> 2 ,... with mixing weights uq, W 2 , W 3 ,... (where 
Wi = 1) is the distribution in which a sample is produced by first picking 
a component distribution—the ith one is picked with probability Wi —and 
then producing a sample from that distribution. In many applications the 
component distributions are multivariate Gaussians. 
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Given samples from the mixture distribution, how can one learn (i.e., re¬ 
construct) the component distributions and their mixing weights? The most 
popular method is probably the EM algorithm of Dempster, Laird and Ru¬ 
bin [7]. EM is a local search heuristic that tries to converge to a maximum- 
likelihood, description of the data by trying to cluster sample points according 
to which Gaussian they came from. Though moderately successful in prac¬ 
tice, it often fails to converge or gets stuck in local optima. Much research 
has gone into fixing these problems, but has not yet resulted in an algorithm 
that provably runs in polynomial time. A second known technique is called 
projection pursuit in statistics [12]. In this, one projects the samples into a 
random low-dimensional space and then, in the projected space, tries to do 
the clustering (exhaustively) exploiting the low dimensionality. 

We note that some combinatorial problems seemingly related to learning 
Gaussian mixtures are NP-hard. For instance, Megiddo [18] shows that it 
is NP-hard to decide, given a set of points in 3^ n , whether the points can 
be covered by two unit spheres. This problem seems related to learning a 
mixture of two spherical Gaussians. 

Nevertheless, one may hope that when the data is generated from the 
mixture of Gaussians (as opposed to being unstructured as in Megiddo’s 
result) then the algorithm could use this structure in the data. Recently, 
Dasgupta [5] took an important step in this direction by showing how a 
mixture of k identical Gaussians could be learned in polynomial time pro¬ 
vided the Gaussians are “spherelike” (their probability mass is concentrated 
in a thin spherical shell) and their centers are “well-separated.” (Such sepa¬ 
ration conditions correspond to a nondegeneracy assumption: if the mixture 
contains two identical Gaussians whose centers are arbitrarily close, then 
they cannot be distinguished even in principle.) 

Though Dasgupta’s algorithm is a good first step, it leaves open the ques¬ 
tion whether one can design algorithms that require weaker assumptions on 
the Gaussians. This is the topic of the current paper: our algorithms make 
no assumption about the shape of the Gaussians but they require the Gaus¬ 
sians to be “well-separated.” Even for the special case of spherical Gaussians, 
our result improves Dasgupta’s (and a result of Dasgupta and Schulman [6] 
that is independent of our work). We describe our results in more detail in 
Section 2.3 and compare them to other work. 

We also define a more general problem of Gaussian fitting, whereby we 
make no assumptions about the data and have to fit the mixture of k Gaus¬ 
sians that maximizes the log-likelihood it assigns to the dataset (see Sec¬ 
tion 2.1). We use techniques developed in the context of approximation 
algorithms to design algorithms for one of the problems (see Section 4). The 
exact problem is NP-hard. 
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2. Definitions and overview. The univariate distribution N(p,,a) on ’ft 
has the density function f{x) = (-v/Ercr) -1 exp(— )• It satisfies E[{x — 
g) 2 ] = a 2 . The analogous distribution in 3ft n is the axis-aligned, Gaussian 
N(p,,a), where p,,a € 3ft n and the density function is the product distribu¬ 
tion of a\), < 72 ), ■ ■ ■, N(fi n , a n )• A random sample (x±,X 2 , ■ ■ ■, x n ) 

satisfies E[£i(xi - Mi) 2 ] = E t of- (Similarly, E[£i(xi - Mi) 2 /of ] = n.) 

A general Gaussian in 3ft n is obtained from an axis-aligned Gaussian by 
applying an arbitrary rotation. Specifically, its probability density function 
has the form 

(1) FqAx) = n. smq) exp( ~ (x ~ pfQ - l(x ~ p)/2) ’ 

where Q is an n x n positive definite matrix with eigenvalues Ai (Q),..., A n (Q) 
0, and p G 3ft n is the center. Since Q can be rewritten as R ~ 1 x diag(Aj(Q))i?, 
where R is a rotation, the quantities A i(Q) play the same role as the variances 
a 2 in the axis-aligned case. From our earlier discussion, E[(x — p) T x (x — p)] 
is EiE(Q) and E[(x - p) T Q~ 2 {x - p)} = f x F QtP (x)(x - p) T Q~ l (x - p) = n. 

For any finite sample of points in 3ft n we can try to fit a Gaussian by 
computing their variance-covariance matrix. Let x\,X 2 , ■ ■ ■ ,xjy be N points 
in 3ft n in general position (i.e., we assume that they do not all lie on a 
hyperplane). Let X be the n x N matrix whose columns are the vectors 
x\ — q,X 2 — q, ■ ■ ■ ,xn — q, where q = + X 2 + ■ • • + xjsr) is the sample 

mean. Then the variance-covariance matrix A = jjXX T ; note that it is 
positive definite by definition. 

This fit may, of course, be poor for an arbitrary point set. However, for 
every e > 0, there is a constant c £ > 0 such that if N > c e n log n and the 
N points are independent, identically distributed samples from a Gaussian 
Fg < p , then with probability at least 0.99, Fa# is a (1 + e)-fit to Fg, p in every 
direction [3, 22] in the sense that |q — p\ 2 < eYjiK{Q) an d |Gh>|(l — e) < 
\Av\ < (1 + e)|Gu| for every unit length vector v. (The proof of this is highly 
nontrivial; but a weaker statement, when the hypothesis is strengthened to 
N > c £ n 2 , is easier to prove.) 

Spherical and spherelike Gaussians. In an axis-aligned Gaussian with 
center at the origin and with variances a 2 ,... ,a 2 , the quantity J2i is 

the sum of n independent identical random variables from IV(0,1) so this 
sum is tightly concentrated about its mean re. In a spherical Gaussian, all 
cij’s are the same, so even x f is tightly concentrated. (These observations 
go back to Borel.) More generally, E[J2iX 2 \ = Ei°f- H the Cj’s are not 
“too different,” then distance-concentration results (similar to Lemma 5 
below) show that that almost all of the probability mass is concentrated 
in a thin spherical shell of radius about (E»such Gaussians may be 
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thought of as spherelike. Roughly speaking, if radius/<7 max = G(logn), then 
the Gaussian is spherelike. Known algorithms (such as [5]) work only for 
such spherelike Gaussians. By contrast, here, we wish to allow Gaussians of 
all shapes. 

2.1. Max-likelihood learning. Now we formalize the learning problems. 

Consider a mixture of Gaussians (w\,F\,W 2 ,F 2 ,. ■ ■ ,w m ,F m ) in K n , where 
the wfis are the mixing weights. With any point x £ §t n , one can associate 
m numbers corresponding to the probabilities assigned to 

it by the various Gaussians according to (1). For any sample S C this 
imposes a natural partition into m blocks: each point x £ S is labeled with an 
integer fix) £ {1,... ,m} indicating the distribution that assigns the highest 
probability to x. (Ties are broken arbitrarily.) The likelihood of the sample 
is 

II F l(x)(x)- 

xGS 

It is customary to work with the logarithm of this likelihood, called the 
log-likelihood. 

Thus one may mean several things when talking about “learning mixtures 
of Gaussians” [21]. The following is the most general notion. 

Definition 1 (Max-likelihood fit). In the max-likelihood fit problem, 
we are given an arbitrary sample S C and a number k\ we desire the 
Gaussian mixture with k components that maximizes the likelihood of S. 

2.2. Classification problem. Now we define the subcase of the learning 
problem when the data is assumed to arise from an unknown mixture of k 
Gaussians, where k is known. 

Definition 2 (Classification problem). In the classification problem, we 
are given an integer k, a real number 5 > 0 and a sample S generated from 
an unknown mixture F\, F 2 ,..., T\. of k Gaussians in where the mixing 
weights W\,W 2 , ■ ■ ■ ,Wk are also unknown. The goal is to find the “correct” 
labeling for each point in S (up to permutation), namely to partition S into 
k subsets such that with probability at least l — d, the partition of S is 
exactly into the subsets of samples drawn according to each F t . 

Viewing the unknown mixture as a “source” we may view this as the 
“source learning” problem. Note that once we know the partition, we can 
immediately obtain estimates of the unknown Gaussians and their mixing 
weights. 
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So, the classification problem has a stronger hypothesis than the maximum- 
likelihood problem in that it assumes that the data came from a mixture. 

It also then requires the result to satisfy the stronger requirement that it is 
exactly the partition into the actual S±, S 2 , ■ ■ ■, Sk, where Si was generated 
according to the Gaussian Fj. (We abuse notation here slightly; we can only 
know the real Si up to a permutation of their indices. However, to avoid 
extra notation, we will say the partition has to be Si, S 2 , ..., Sfc.) 

2.3. Our results. Our main result concerns the classification problem. 
Clearly, the problem has no unique solution if the Gaussians in the mixture 
are allowed to be arbitrarily close to each other. We will assume a certain 
separation between the centers of the Gaussians. The required separation is 
an important consideration and will be motivated in detail in Section 3.2. 
Here we will just state it and mention two important features of it. 

Notation. First, we introduce some notation which we will use through¬ 
out. We let p\,p 2 , ■■■,pk denote the (unknown) centers, respectively, of the 
k Gaussians F\, F 2 ,.. ., F*. comprising the mixture; the maximum variance 
of Fi in any direction will be denoted oy max . We denote by Ri the “median 
radius” of Fj; Fj has the property that the Fj-measure of a ball of radius Ri 
around pi is exactly 1/2. Henceforth, we will reserve the word “radius” of a 
Gaussian to mean its median-radius. 

Here is our formal definition of separation. 

Definition 3. For any t > 0, we say that the mixture is t-separated if 

/"2 \ I Pi Pj\ — | Ri — Rj | T 500 t(Ri + i?j)(oy max + cry max ) 

+ 100t 2 (a 2 max + cr 2 ma J Vi, j. 

We point out here quickly two features of this definition. First, if two 
Gaussians Fi,Fj are both spherical of the same radii (Ri = Rj), then the 
required separation is O* (Fj/n 1//4 ). Second, if Fj, Fj are still spherical, but if 
Rj > Ri, the term — |F 2 — Rj | is negative and makes the separation required 
less. Indeed if Rj = (1 + fl*(l/- v /n))Fj, then the two Gaussians Fj,Fj are 
allowed be to concentric! The superscript * on 0,0 indicates that we have 
omitted factors logarithmic in n. 

Theorem 1 . There is a polynomial-time algorithm for the classification 
problem. The algorithm needs to know a lower bound tc m i n on the mixing 
weights, and the number s of sample points required is 0(n 2 k 2 log (kn 2 ) / (S 2 w ^ n 
The Gaussians may have arbitrary shape but have to be t-separated, where 
t = 0( ^). 
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We also present an approximation algorithm for a special case of the 
max-likelihood fit problem. 

Theorem 2. There is a polynomial-time approximation algorithm for 
the max-likelihood, fit problem in iR n when the Gaussians to be fitted to the 
data have to be spherical of equal radii (the radius and the centers of the k 
Gaussians have to be determined by the algorithm). There is a fixed constant 
c such that the algorithm produces a solution whose log-likelihood is at least 
the best possible minus c. 

The algorithm of Theorem 2 is combinatorial and appears in Section 4. 
We note even this subcase of the maximum-likelihood fit problem is at least 
as hard as the clustering problem /c-median (sum-of-squares version with 
Steiner nodes), which is NP-hard [8]. Indeed, our algorithm is obtained by 
reducing to the ^-median algorithm of [4] (recent more efficient fc-median 
algorithms would also work). We feel that this way of viewing the learning 
problem as an approximation problem may be fruitful in other contexts. 

2.4. Comparison with other results. The algorithm in [5] makes the fol¬ 
lowing assumptions: (i) all the Gaussians have the same variance-covariance 
matrix X; (ii) the maximum and minimum eigenvalues cr^ iax and cr aiin , re¬ 
spectively, of X satisfy £ 0(^/n/\ogk)\ (iii) the centers of any two of 
the k Gaussians are at distance (at least) kl(^/na ma , x ) apart. 

Dasgupta and Schulman [6] showed that the EM algorithm learns (and in¬ 
deed does so in just two rounds) a mixture of spherical Gaussians Tj, T 2 , ■ ■ •, Tfc, 
where T) has radius Ri (the Ri may be different). They require now only 
a separation between centers of Fi,Fj of kl((Ri + i?j)/n 1//4 ). (This amount 
of separation ensures among other things that the densities are “nonover¬ 
lapping”; i.e., there are k disjoint balls, each containing the samples picked 
according to one T).) 

As mentioned, our result is stronger in two ways. First, we allow Gaus¬ 
sians of arbitrary (and different) variance-covariance matrices and, second, 
we allow densities to overlap, or even be concentric. More specifically, the 
term — \Rf — Rj j (which is nonpositive) can make the minimum required 
separation negative (and so a vacuous requirement) in some cases; it allows 
the centers to be close (or even coincide) if the radii are very different. This 
allows a “large feature” to have an identifiable smaller “feature” buried in¬ 
side. For the case dealt with by [6], their requirement is the same as ours 
(since in this case Rj ~ -y/noymax) but for this term and logarithmic factors 
and thus their result essentially follows as a special case of ours. For the case 
dealt with by [5], our requirement is again weaker than that paper’s but for 
logarithmic factors [since y^^Tmax £ O(Ri)]. After the first appearance of 
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our paper [2], Vempala and Wong [25] improved the separation requirement 
to essentially the optimal one for the special case of spherical Gaussians: 

|pi — pj | = f l((Ri + Rj)/y/n). Their spectral technique is entirely different 
from ours. 

3. Algorithm for classification problem. First we fix notation for the 
rest of the section. We are given a set S of samples, picked according to 

an unknown mixture w\F\ + w^Fi + -H w^F^ of Gaussians Fi, F2, ..., F^. 

The known quantities are k and a number rc m ; n that is a lower bound on 
the iCj’s. We have to decompose S as S = S± U S 2 U • • • U Sk, where Si are 
the samples that were generated using Fi. 

Section 3.1 describes the algorithm at an intuitive level. This descrip¬ 
tion highlights the need for a “well-separated” condition on the Gaussians, 
which we explain in Section 3.2. The description also highlights the need 
for “distance concentration” results for Gaussians, which are then proved in 
Section 3.3. In Section 3.5 we formally describe the algorithm and prove its 
correctness. 

3.1. Algorithm overview. The algorithm uses distance-based clustering, 
meaning that we repeatedly identify some sample point x and some dis¬ 
tance r, and all sample points in B(x,r) all put into the same cluster. Such 
distance-based clustering is not new and it appears in many heuristics, in¬ 
cluding [5, 6]. The choice of x,r is of course the crucial new element we 
provide. Since distance-based methods seem restrictive at first glance, the 
surprising part here is that we get provable results which subsume previ¬ 
ously known provable results for any algorithm. This power arises from a 
“bootstrapping” idea, whereby we learn a little about the Gaussians from 
a coarse examination of the data and then bootstrap from that information 
to find a better clustering. 

In general, distance-based clustering is most difficult when the Gaussians 
have different shapes and sizes, and overlap with each other (all of which we 
allow). It is easy to see why: a sample point from Gaussian F) might be closer 
to some sample points of another Gaussian Fj than to all the sample points 
of Fi. One crucial insight in our algorithm is that this is unlikely to happen 
if we look at the Gaussian with the smallest radius in the mixture; hence 
we should use clustering to first identify this Gaussian, and then iterate to 
find the remaining Gaussians. 

Now we give an overview of the algorithm. Let F\ be the Gaussian of 
smallest radius. Using our distance-concentration results, we can argue that 
for any x £ Si, there is an r such that (i) B(x,r) H S = S\] (ii) there is 
a “sizable” gap after r, namely, the annulus B(x,r') \B(x,r ) for some r' 
noticeably larger than r contains no samples from any Sj for j > 1; (iii) 
there is no spurious large gaps before r, which would confuse the algorithm. 
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Even after proving the above statements, the design of the algorithm is 
still unclear. The problem is to figure out the size of the gap between where 
S\ ends and S \ S\ begins, so we know when to stop. (Note: there will be 
gaps before r; the point is that they will be smaller than the one after r.) 
Our separation condition ensures that the gap between S\ and the other Sj 
is fl(*7i m a Y ); so we need an estimate of <7i jmax . We get such an estimate by 
bootstrapping. We show that if we have any fraction / of the samples in Si, 
then we may estimate ui )max to a factor of 0(1/f 2 ) with high probability. 
We use this to get a rough estimate j3 of cri jmax . Using (5, we increment the 
radius in steps which are guaranteed to be less than <7i m a x (which ensures 
that we do not step over the “gap” into another Sj) until we observe a gap; 
by then, we have provably picked up most of S\. Now we use this to better 
estimate cr\. max and then incrementing the radius by another fl(<7i jmax ), we 
capture all of Si. (The guaranteed gap ensures that we do not get any points 
from any other Gaussian while we increment the radius.) 

To make all the above ideas rigorous, we need appropriate distance- 
concentration results which assert that the distance between certain pairs 
of sample points considered as a random variable is concentrated around a 
certain value. Some distance-concentration results—at least for spherical or 
spherelike Gaussians—were known prior to our work, showing a sharp con¬ 
centration around the mean or median. However, for the current algorithm 
we also need concentration around values that are quite far from the mean 
or median. For example, to show the nonexistence of “spurious gaps,” we 
have to show that if a ball of radius r centered at a sample point x E Si 
has Ej-measure, say, exactly 1/4, then, for a small 5 > 0, the ball of radius 
r + 6 with x as center has Ej-measure at least 0.26. If such a result failed 
to hold, then we might see a “gap” (an annulus with no sample points) 
and falsely conclude that we have seen all of Si. Such concentration results 
(around values other than the median or mean) are not in general provable 
by the traditional moment-generating function approach. We introduce a 
new approach in this context: isoperimetric inequalities (see Theorem 3). 
Our method does not always prove the tightest possible concentration re¬ 
sults, but is more general. For example, one may derive weaker concentration 
results for general log-concave densities via this method (see [17]). 

3.2. Separation condition and its motivation. Now we motivate our sep¬ 
aration condition, which is motivated by the exigencies of distance-based 
clustering. Consider the very special case of spherical Gaussians E*,Ej with 
Ri = Rj. Suppose x,x',y are independent samples, x,x' picked according to 
Ej and y according to Fj. Lemma 5 will argue, that with high probability 
[we will use « here to mean that the two sides differ by at most 0(R 2 /y/n)], 

\x-pi\ 2 fuR 2 


MIXTURES OF GAUSSIANS 


9 


and similar concentration results for | x' — p % \,\y — pfi. It is an intuitive fact 
that x — Pi, x' — Pi,Pi — Pj,y — pj will all be pairwise nearly orthogonal (a 
sample from a spherical Gaussian is almost orthogonal to any fixed direction 
with high probability). So, one can show that 

Pi\ 2 + \Pi ~ x'\ 2 ~ 2Ri, 

Pi I 2 + \Pi~Pj\ 2 + \y~Pj\ 2 ~ 2 Rj + \pi-pj \ 2 - 

(The first assertion is proved rigorously in greater generality in Lemma 7 
and the second one in Lemma 8.) Thus, it is clear that if \pi —pj\ 2 is at least 
Ll(R?/y/n), then with high probability \x — x'\ and \x — y\ will lie in different 
ranges. (Aside: One can also show a sort of converse with different constants, 
since the concentration results we get are qualitatively tight. However, we 
will not establish this, since it is not needed.) This intercenter separation 
then is 

OiRi/n 1 ^). 

Our separation condition for this case is indeed this quantity, up to a factor 
log re. A weaker separation condition would be to require a separation of 
Q(Ri/y/n)] at this separation, one can still show that with high probability 
the hyperplane orthogonal to the line joining the centers at its midpoint has 
all the samples of one Gaussian on one side and the samples of the other 
Gaussian on the other side. An algorithm to learn under this condition would 
be a stronger result than our distance-based algorithm in this case. Since 
the conference version of our paper appeared, Vempala and Wang [25] have 
indeed developed a learning algorithm under this weaker separation for the 
case of spherical Gaussians using spectral techniques. 

3.3. Concentration results using isoperimetric inequalities. Suppose we 
have some probability density F in and a point x in space. For proving 
distance concentration results, we would like to measure the rate of growth 
or decline of F{B(x,r )) as a function of r. This will be provided by the 
isoperimetric inequality (see Corollary 4). 

Theorem 3 [14], Let F(x) = e~ xTA lx g(x ) be a function defined on 
where A is a positive definite matrix whose largest eigenvalue is and 

g(x) is any positive real-valued log-concave function. Suppose v is a positive 
real and we have a partition ofiR n into three sets K\,K 2 ,K^ so that, for all 
x £ Ki, y S K 2 , we have \x — y\ > v. Then 

f 2ue~ u2 Iff f 

/ F{x)dx> -—-mini / F{x)dx, / F(x)dx 

JKs v'TT Gmax \JK\ J K 2 


( 3 ) \x — x '\ 2 « \x — 

( 4 ) \x — y | 2 « \x — 
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The phrase “isoperimetric inequality” has come to mean a lower bound 
on the surface area of a set in terms of its volume. If K\ is fixed and we 
define K 3 to be the set of points not in K\ which are at distance at most v 
from some point in K\ and define K 2 = \ {K 1 U K 3 ), then as v goes to 

zero, K 3 tends to the boundary surface of K\ and the above theorem can be 
shown to yield a lower bound on the surface integral of F over this surface. 
We will make this connection rigorous below for the context we need. Such 
isoperimetric inequalities for general log-concave measures over multidimen¬ 
sional sets were first proved for use in establishing rapid convergence to the 
steady state of certain Markov chains for approximating volumes of convex 
sets and for sampling according to log-concave measures [1, 9, 16]. The proof 
of Theorem 3 uses a specialization of the above techniques to the case of 
Gaussians, where we get better results. 

Corollary 4. We borrow notation from Theorem 3 and also assume 
that F(iR n ) = 1: 

(i) If a ball B(x,r ) has F(B(x,r )) < 1/2, then 

d(HF(B(x,r)))) > 2 

dr ~ \/vro-max' 

(ii) If a ball B(x,r ) has F(B(x,r)) > 1/2, then 

dQn(l - F(B(x,r)))) ^ -2 

dr \/7r ( 7'max 

Remark. The corollary says that ln(.F(.B(:r,r))) grows at a rate of 
11(1/cTmax) until F(B(x,r)) is 1/2, and then ln(l — F(B(x,r))) declines at 
a rate of f2(l/<T max )- Intuitively, it is easy to see that this would lead to dis¬ 
tance concentration results since once we increase (decrease) r by 0 (a max ) 
from its median value, the mass outside B(x,r ) [inside B(x,r)] is small. 
The first lemma below (Lemma 5) is derived exactly on these lines; the sub¬ 
sequent three Lemmas 6-8 discuss the distances between different samples 
from the same and from different Gaussians. 


Proof of Corollary 4. Let u be an infinitesimal. Then 
d(ln(F(B(x,r)))) _ 1 d(F(B(x,r))) 

dr F(B(x,r )) dr 


= lim 


v->o vF(B{x,r)) 


[. F(B(x,r + v))-F(B(x,r ))]. 


Now we let K\ = B{x , r) and K 2 = \ B(x, r + u) and apply the theorem 

above to get the first assertion of the corollary. The second assertion follows 
similarly. 


□ 
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Lemma 5. Suppose F is a general Gaussian in iR n with maximum vari¬ 
ance in any direction a, radius R and center p. Then, for any t > 0, we 
have 

F({x: R — ta <\x — p\ < R + hr}) > 1 — e _t . 

Proof. For any 7 > 0, let F(B(p, 7 )) = g( 7 ). Then, for 7 < R, we have 
by Corollary 4 that 

dIn (g( 7 )) > 1 

dy ~ a 

Integrating from 7 to R, we get that 

F(B(p, 7 ))<ie-^- 7 ^. 

For 7 > R, isoperinretry implies that 

d(ln(l-ff( 7 ))) < -1 
d'y ~~ a 

Again integrating from R to 7 , we get 1 — g( 7 ) < . Combining 

the two, the lemma follows. □ 

Lemma 6. Let F,p, R, a be as in Lemma 5 and suppose z is any point in 
space. Let t > 1. If x is picked according to F, we have that, with probability 
at least 1 — 2 e~ f , 

(.R + ta ) 2 + \z — p \ 2 + 2 \f 2 \ft\z — p\a 
(5) > \x — z \ 2 

>((R — ta ) + ) 2 + \z — p \ 2 — 2v / 2y / t|z — p\a, 

where (R — ta) + is R — ta if this quantity is positive and 0 otherwise. 
Proof. We have 

\ x ~ z i 2 = ((* ~p) + (p- z )) ■ {{x ~ p) + ip - z)) 

= \x — p \ 2 + I p- z \ 2 + 2 (x -p)‘{p- z). 

Now 2 (x — p) • (p — z) is a normal random variable with mean 0 and variance 
at most 4| p — z\ 2 a 2 , so the probability that \2(x — p) • (jp — z)\ is greater 
than 2\f2\ft\z — p\a is at most e _t . From Lemma 5, we have that R — ta< 
\x — p\ <R + ta with probability at least 1 — e _t . Combining these two facts, 
the current lemma follows. □ 

Lemma 7. Suppose F,p,R,a as in Lemma 5. Suppose x,y are indepen¬ 
dent samples each picked according to F. Then for any t> 1, with probability 
at least 1 — 3e^, we have 

2R 2 - 8 taR <\x- y \ 2 < 2(R + 2ta) 2 . 
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Proof. We may assume that x is picked first and then y (indepen¬ 
dently). Then from Lemma 5, with probability 1 — e _t , we have R — ta < 

|a: — _p| < R-\-ta. From Lemma 6 (once x is already picked), with probability 
at least 1 — 2e _t , we have (R + ta ) 2 + \x — p\4y/ta + \x — p\ 2 > \x — y \ 2 > 

R 2 — 2 Rta — 4\x — p\Vta + \x — p\ 2 . Both conclusions hold with probability 
at least 1 — 3e~ t , whence we get 

\x — y \ 2 <(R + ta) 2 + 4ta(R + ta) + (R + ter ) 2 < 2 (R + 2 ta) 2 . 

For the lower bound on \x — y\ 2 , first note that if R < 4ta, then 2 R 2 — 8 ta < 0, 
so the lower bound is obviously valid. So we may assume that R > 4ta. Thus, 

\x — p | > 3ta and under this constraint, \x — p\ 2 — 4y/ta\x—p\ is an increasing 
function of \x — p\. So, we get 

\x — y\ 2 >R 2 — 2Rtcr — 4ta(R — ta) + (R — ta) 2 , 

which yields the lower bound claimed. □ 

Lemma 8. Let t > 1. If x is a random sample picked according to Fi and y 
is picked independently according to Fj, with Fi,Fj satisfying the separation 
condition (2), then, with probability at least 1 — 6e _t , we have 

\x - y \ 2 > 2min (R 2 ,R 2 ) + 60f(cr iimax + a jtmax )(Ri + Rj) 

{') i W 2/J , 2 \ 

' wz,max ' ^j, max/* 

Proof. Assume without loss of generality that Ri < Rj. Applying Lemma 6, 
we get that, with probability at least 1 — 2 e~ t , we have 

(8) | y Pi\ Rj 2f (Jj max Rj + | Pi Pj | 2v / 2\/t|pi Pjl&j, max' 

Claim 1. 

(9) |?/ — Pi| 2 > -Rf + 154 t(ai <max + aj <max )(Ri + Rj) F30F (a 2 max +a 2 max ). 

Proof. Case 1. R 2 > R 2 + 250t(Rj + Rj)(a, )m ax + T?,max) + 30t 2 (cr? max -|- 
^max) ' Note that bi - Pjl 2 - 4 ^CTj- maxbi - Pj| + 4tcr] max > 0. So, |Pi - 
Pj\ 2 - As/ta j)majX \pi- pj\ > -4i<r| max . Plugging this into (8), and using the 
case assumption, we get 

\V~Pi\ 2 > Ri + 250 t(Ri + Rj ) ((X^max + O'^max) 

+ 30f (<7j max + cr^max) — 2taj tmax Rj — 4taj max . 

It is easy to see that Rj > (2/3)aj jmax - this is because R 2 is clearly at least 
the median value of (u ■ x) 2 under Fj, where u is the direction achieving 
a j ma y; now it is easy see that, for the one-dimensional Gaussian u ■ x, the 
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median value of (u ■ x ) 2 is at least 2/3 times the variance by direct calcu¬ 
lation. Plugging cr? max < (3/2)Rj<jj )max into the above inequality, we easily 
get the claim. 

Case 2. Rj < 1?| + 250 t(Ri + Rj)((J hmilx + a^ max ) + 30t 2 (o- 2 max + <r 2 max ). 
Then by the separation condition, we have 

IP* — Pjl -L 250+ Rj){cTi max + CTj jmax ) + 70i"( cr i, ma x + ^j,max)• 

Now, since \pi — Pj\ 2 — 2\//2v / Lx,',max|Pi ~ Pj\ is an increasing function of 
| Pi ~ Pj | for I Pi -Pj I > 2v / 2v / t<T : , j maxj we have 

|P* Pj 2v / 2\/tcr J)max |p 7 ; pj 

— 250+ i?j)(cTj jmax + CTj^max) + 70t (f^max + ^max) 

- 2A/2v / t<Tj imax (16\/fA/^j + /?■;\/ ^j,max + ^z,max + 9 t{ 

&j ,max “1“ ^z,max)) 

— 156 t(Ri + i?j)((Jj imax + climax) + 34t (<7^ max + cr j>max ), 

using the inequality y/a + b < y/a+y/b, V a, b > 0 and observing that aj ma x < 

(3/2)Rj and <7j,max(<T/,max + &i, max) < V / 2(<7^ max "b ^j.max)' 

Putting this into (8), we get 

|y — Pi\ 2 ^ R{ — 2t(jj^m&xRj T 156t(Z2j + Rj){ ^i,max ^j,max) 

+ 34f 2 (<7 2 max + of,max)> 

which yields the claim in this case. □ 

Imagine now y already having been picked and x being picked inde¬ 
pendently of y. Applying Lemma 6, we get that, with probability at least 
1 — 2 e _t , we have (again using the inequality, y/a + b < yfa + y/b, Va, b > 0) 

!■£ V | R% 2i?jtfJj ima x + \y Pi\ 2\/2V / t(Tj im ax|y Pi| 

> I? 2 - 2Rita iyma , x + R 2 + 154f( 

^i,max ^j,max) (-^i H - Rj ) 

+ 30f 2 (cr 2 max + cr 2 max ) 

2v / 2\/f<7j jmax (.Rj + 13 v /( U.,max T ^y.rnax / R'i T -R/ 

+ \/30t((Ti imax + (T^max)) 

[because under condition (9), \y~Pi\ 2 — 2y/2y/t,cri )max \y — pi\ is an increasing 
function of \y — Pi\] 

\x - y\ 2 > 2R 2 + (154 - 2 - 2^2(1 + 13^3/2 + 1.5\/30)) 

x tipi ,max ~b tJ j,max){Ri + Rj) + 30f (°’i )ma x "b ^pmax); 
from which the lemma follows. □ 
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3.4. Warm-up: case of spherical Gaussians. As a consequence of our 
concentration results we first present our algorithm for the simple case when 
all the Fj are spherical. In this case, (j^ max ~ Ri/y/n, where the error is small 
enough that our calculations below are valid. Choosing t = n(log(|5|/5)) as 
before, it is easy to see from the distance concentration results that, with 
high probability, 


( 10 ) \x-y\ 2 e 


1R -(1 / -],2R 2 (l + 


Vx,y £ Si, \/i, 


and by appropriately choosing the constant in the definition of f-separation 
we can also ensure that with high probability there is a positive constant 
d > 12 such that 

(11) \x — y \ 2 > 2min(it! 2 , R 2 ) + ° + ^ V x £ S), V y £ Sj,\/ i / j. 

\Jn 

For each pair x,y £ S find \x — y\ and suppose xq, yo is a pair (there may be 
several) at the minimum distance. Then from (10) and (11) it follows that 
if xo £ Si, then for all y £ Si, |x — y\ < (1 + -^=)|xo — Vo\ and furthermore, 

for all z £ S \ Si, \x — z\ > (1 + 4=)|xo — yo\ • So, we may identify Si by 
S n B(x o, |xo — yo|(l + -^=))- Having thus found Si, we may peel it off from 
S and repeat the argument. The important thing here is that we can estimate 
the radius of the ball—namely, |xo — 2/o|(l + -^=)— from observed quantities; 
this will not be so easily the case for general Gaussians. 


3.5. The general case. Now we consider the case when the Gaussians 
may not be spherical or even spherelike. Let 5 > 0 be the probability of failure 
allowed. We are given a set of samples S drawn according to an unknown 

mixture of Gaussians w\F\ + W 2 F 2 H-1- w^F^-, but we are given a rc m j n > 0 

with Wi > W min for all i. We assume that |S| > 10'n 2 fc 2 log(fcn 2 )/(J 2 u; (( lir , ). 
In what follows, we choose 

100 log | S\ 

5 ' 

The Algorithm. Initialization: T <r—S. 

1. Let a > 0 be the smallest value such that a ball B(x,a) of radius a 
centered at some point in x £ T has at least 3u; m i n |5|/4 points from T. 
(This will identify a Gaussian F t with approximately the least radius.) 

2. Find the maximum variance of the set Q = B(x,a) (IT in any direction. 
That is, find 

Q = max V ( w ■ y — w ■ ( —— z ) ) . 
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(This (3 is our first estimate of <T max . Note that computing f3 is an eigen¬ 
value computation, and an approximate eigenvalue suffices.) 

3. Let v = \J Win $ n ^ ■ (We will later show that v < er max ; so increasing the 
radius in steps of v ensures that we do not miss the “gap” between the Si 
that x belongs to and the others.) Find the least positive integer s such 
that (we will later prove that such a s exists) 

B(x, a + sis) n T = B(x,a + (s — 1)^) n T. 

4. Let Q' = B(x, a + sis) IT T. As in step 3, find the maximum variance j3' 
of the set Q' in any direction. (We will prove that this /3' gives a better 
estimate of <7 max .) 

5. Remove B(x,a + sis + 3^fP r (log\S\ —log(5 + l))nT from T. (We will show 
that the set so removed is precisely one of the Si.) 

6. Repeat until T is empty. 

Remark 1. Ball B(x, a + sis) will be shown to contain all but w m i n / (lOrc*) 
of the mass of the Gaussian i 7 ) we are dealing with; the bigger radius of 
B(x,a + sis + 3\/^ 7 (log |S'! — log 8 + 1)) will be shown to include all but 
<5/(10|S'| 2 ) of the mass of F t . This will follow using isoperimetry. Then we 
may argue that with high probability all of Si is now inside this ball. An 
easier argument shows that none of the other Sj intersect this ball. 

Now we prove why this works as claimed. Let 5 > 0 be the probability of 
failure allowed. Recall that 

100 log | S| 

5 ' 

We will now show using the distance-concentration results that several 
desirable events [described below in (12)—(18)] happen, each with probabil¬ 
ity at least 1 — We will assume from now on that conditions (12)-(18) 
hold after allowing for the failure probability of at most 7<5/10. The bottom 
line is that the sample is very likely to represent the mixture accurately: 
the component Gaussians are represented essentially in proportion to their 
mixing weights; the number of samples in every sphere and half-space is 
about right and so forth. 

First, since |5j| can be viewed as the sum of |£| Bernoulli independent 
0-1 random variables, where each is 1 with probability Wi, we have [using 
standard results, e.g., Hoeffding’s inequality, which asserts that for s i.i.d. 
Bernoulli random variables X\ , X 2 ,..., X s with Prob(Aj = 1) = q, for all real 
numbers a > 0, Prob([ )T)f=i Xi — sq\ >a)< 2e _ “ 2|J / 4s ] that, with probability 
at least 1 —5/ 10, 

(12) 


l.lu;i|S , | > \Si\ > 0.9tt!j|S , | Vz. 
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For each i, 1 < i < k, and each x E Si, let rj(x) be the least positive real 
number such that 


Fi(B(x,r](x))) > 1 


5 

10|S| 2 ' 


Now, we assert that, with probability at least 1 — 

(13) Vi, 1 < i < k, Mx E Si Si C B(x,r](x)). 


To see this, focus attention first on one particular x E S, say x E Si. We may 
imagine picking x as the first sample in S and then independently picking 
the others. Then since x is fixed, y(x) and B(x,r](x)) are fixed; so from 
the lower bound on Fi(B(x,r](x)), it follows that Prob(S'j C B(x,r/(x))) > 
1 — <5/(10|S|). From this we get (13). 

We have from Lemma 7 that, with probability at least 1 — (5/10, the 
following is true for each i, 1 < i < k, \/ x, y E Sf. 

(14) 2 R 2 - 8 ta i)raax Ri < \x - y \ 2 < 2 (Ri + 2f<7 ijmax ) 2 . 

Further, from Lemma 8, we have that, with probability at least 1 — 

\/i,j<k,i^j,VxeSi,Vye Sj 

(15) \x - y | 2 >2mm(R 2 ,Rj) + 60 t{Ri + Rj){t 7j !max + <Jj, max) 

+ 30t J (cr 2 max + cr 2 max ). 

Next, we wish to assert that certain spherical annuli centered at sample 
points have roughly the right number of points. Namely, 


V*, V x,y,z E Si letting A = B(x, \x — y\) \ B(x, \x — z\) 


(16) 


we have 


\SiHA\ 


\Si 


~Fi{A) 


5/2 
< ^min 

- 160 ‘ 


We will only sketch the routine argument for this. First, for a particular 
triple x, y, z in some Si, we may assume that these points x, y, z were picked 
first and the other points of the sample are then being picked independently. 
So for the other points, the annulus is a fixed region in space. Then we may 
view the rest as Bernoulli trials and apply Hoeffding’s inequality. The above 
follows from the fact that the Hoeffding upper bound multiplied by \S\ 3 (the 
number of triples) is at most 5/10. 

Next, we wish to assert that every half-space in space contains about 
the correct number of sample points. For this, we use a standard Vapnik- 
Chervonenkis (VC) dimension argument [24]. They define a fundamental 
notion called VC dimension (which we do not define here). If a (possibly 
infinite) collection C of subsets of has VC dimension d and T> is an 
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arbitrary probability distribution on 4i n , then for any p,£> 0 and for a set 
of 


4 2 8 d , 8d 
- log - H-log — 

e p e £ 


independent identically distributed samples drawn according to V, we have 
that with probability at least 1 — p, for every H £ C, the fraction of samples 
that lie in H is between T>{H) — £ and T>(H) + e. 

In our case, C consists of half-spaces; it is well known that the VC dimen¬ 
sion of half-spaces in 4?" is n. We consider each component Fj of our mixture 
in turn as V. We have drawn a sample of size | ^ | from ly. Applying the VC 
dimension argument for each i, with p = 5/10k and e = rc m in/100, and then 
using the union bound, we conclude that, with probability at least 1 — 5/10, 
the sample satisfies 


(17) 


Vi, V half-spaces H 

||5i nH\- ISilFiiH)] < u) min |S'j|/100. 


From Lemma 12 (to come), it follows that, with probability at least 1 — 
5/10, we have 


(18) V unit length vectors w, V i 


nur J2( w '( x -p^ 2 - 2a l 

x£Si 


Lemma 9. Each execution of steps 1-5 removes precisely one of the Si. 

Proof. The lemma will be proved by induction on the number of execu¬ 
tions of the loop. Suppose we have finished l — 1 executions and are starting 
on the Zth execution. 

Let P be the set of j such that Sj has not yet been removed. (By the 
inductive assumption at the start of the loop, T is the union of Sj, j 6 P.) 

Lemma 10. Suppose x £ S is the center of the ball B(x,a) found in 
the Ith execution of step 1 of the algorithm and suppose x belongs to Si (i 
unknown to us). Then 

(19) B(x, a) C S C Si, 

13- y\“ F + 50t((7j imax + Tfmax)(-P?. T Pj) 

+ 20f 2 (cr 2 max + CT 2 max ) V y G Sj, V j / i, j G P. 

Proof. For any j S P, and all y, z G Sj, we have from (14) that \z — y | 2 < 
2 (Rj + 2t<jj ma x) 2 . Thus, a ball of radius s/2(Rj + 2cJi imax ) with y as center 
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would qualify in step 1 of the algorithm by (12). So, by definition of a in 
that step, we must have 

(21) a < V2(Rj + 2fi7j jmax ) VjGP. 

If now B(x,a) contains a point z from some Sj, j / i, by the inductive 
assumption in Lemma 9, we must have j E P. Then by (15) we have 

a 2 > 2min (R^Rj) + 60 t{Ri + Rj){a itmax + cry max ) + 30t 2 (cr 2 max + cr 2 max ), 

which contradicts (21) [noting that (21) must hold for both i,j\. This proves 
(19). 

Now, from the lower bound of (14), it follows that 

a 2 > 2 Rj - 8 Ri<J itmax t. 

So from (21) it follows that 

2 R] > 2 Ri - 8 t(Ri + Rj)(ai >max + <r jt max ) - 8t 2 <j 2 max V j E P. 

Thus from (15), we get that, for y E Sj,j / i, 

(22) \ x ~~ y\ — i ~ &t(Ri + Rj)(<Ji,max + 0y,max) — 8 1 (T- max 

+ 60 t(Ri + i?j)((Tj imax + (Tj max ) + 30t (<7j )inax + CTj max ), 

from which (20) follows. □ 


Now we can show that (3 is a rough approximation to oy max . 


Claim 2. 


The (3, Q computed in step 2 of the algorithm satisfy 


2 | Si 

\Q\ 


-CT; 


>(3> 


M 

4| 5i 


2 ^ i , max’ 


Proof. For any unit length vector w , we have, by (18), 

(x-pi)f < ( w ' ( x ~Pi)) 2 < 2 \Si\crl max . 

x&Q x&Si 


Since this holds for every w, and the second moment about the mean is less 
than or equal to the second moment about pi, we have that /3 < 2w°f max - 
This proves the upper bound on (3. 

Let u be the direction of the maximum variance of F t . We wish to assert 
that the variance of Q along u is at least |Q| 07 max /|S)|. To this end, first 
note that, for any reals 71,72, with 71 > 0 , we have 


Probi^ (72 - 71 < x ' u < 72 + 7i) 


(■72+71 


2v / vr cn, 


—r 2 /2a 2 
> ' 1,1 


dr 


max •172—71 


< 


7i 


s/nvi, 


max 
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Let 72 = E xgq( u • x ) and let 71 = m O'*, max- Then the strip H = {x: 72 - 
7 i < u ■ x < 72 + 71} satisfies Fi ( H ) < 7i/(v / ^ c7 i,max)- So, by (17), 


\ SinH \<\ Si \ Fi ( H ) + 


W min 


I Si 


100 



So, we have that 


using |Q| > fwmi n lS 1 !. 


1 

W\ 


J2( u - 

x£Q 


x - 72) 2 > 


1 I <51 IQI 2 2 1 2 IQI 2 

\ Q \ 4 \ Si\ 2(Ti ’ max 4 CT ™| 5 i | 2 ’ 


from which the lower bound on (3 obviously follows. □ 


Corollary 11. 


The (3 computed in step 2 of the algorithm satisfies 


4 


W min 


07 


max 


>P> 




max - 


Proof. Since \ Q \ > 3 u; m i n | 5 |/ 4 , Claim 2 implies the corollary. □ 


From (14) we have 

Vy^Si \x - y\ 2 < 2 Rj + 4tRia i)max + 4:t 2 al mSiX . 

From (20) we have 

y z G (J Sj \x — z\ 2 X > 2 R 2 + 50t(cr* ima x + <7j, max) (Ri + Rj ) 

j£P\{i} 

+ 20t 2 (fJ l 2 max + cr 2 max ). 

Thus, there exists an annulus of size (where the size of an annulus denotes 
the difference in radii between the outer and inner balls) oy max around x 
with no sample points in it. Since we are increasing the radius in steps of 
v which is at most oy max (Corollary 11) there is some s in step 3 of the 
algorithm. Also, we have 

B(x,a + sv) (7 S C Si. 

The trouble of course is that such a gap may exist even inside Si, so 
B(x,a + sv) may not contain all of S). To complete the induction we have 
to argue that steps 4 and 5 will succeed in identifying every point of S For 
7 > 0 , let 


5(7) = Fi ( B ( x , 7)). 


From B(x,ot + sis) flT = B(x,a + (s — \)v) n T (see step 3 of the algo¬ 
rithm), we get using (16) that 


g(a + sv) — g(a + (s — \)v) < 


w 


5/2 

min 


160 
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Since v = \J^f^ 
exists a 7 ' E [(s — 


, we get using Corollary ll’s lower bound on /3 that there 
1 )v + a sv + a) with 


f ^g(7) \ 

\ *^7 ) ■y='y' 


5/2 

W ■ 

mm 


W n 


160 v 20ov 


(If not, integration would contradict the previous inequality.) Thus isoperime¬ 
try (Corollary 4) implies that 


g(a + sv) > 1 - or g(a + (s - l)v) < O.luw. 

The latter is impossible since even the a-radius ball contains at least 3|5|rc m i n /4 
points. This implies that g(a + sv) > 0.9 and now again using (16), we see 
that \Q'\ > 0.81S'*| (note that Q' is found in step 4). Thus from Claim 2 
(noting that the proof of works for any subset of Si), we get that (3' is a 
fairly good approximation to ay m ax : 

(23) 2.5af )max > P ' > 0.16of max . 


From the definition of s in step 3 of the algorithm, it follows that there is 
some y G Si with \x — y\ > a + (s — 2 )v. So, from (14), we have a + (s — 2)v < 
\[2(Ri + 2ta i:mSLX ). So, we have 


(24) 


a + sv + 


3 y/P' ^log ■L-^- + l^j < V2 (Ri + 2tfjj imax ) + 2cij )r] 


+ 4 <7j , 


> og ^ +1 


Thus from (20), no point of Sj, j G P\ {/}, is contained in B(x, a + sv + 
(log ^ + 1)). So the set removed from T in step 5 is a subset of S t . 
Finally, using g(a + sv) > 9/10, and isoperimetry (Corollary 4), we see 
that 

s(a + + 3v^(log hi + l)) > 1 - ^ 


which implies that g(x) <a + sv + 3 y/J ?(log ^ + 1). Thus, by (13), all of Si 
is in B(x, a + sv + Sy/J ?(log ^ + 1)). This completes the inductive proof of 
correctness. □ 


Now we prove a lemma that was used above when we estimated cr max . 

Lemma 12. Suppose F is a ( general) Gaussian in 5i n . If L is a set 
of independent identically distributed samples, each distributed according 
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to F, then with probability at least 1 — we have [with e = 20n(- v /logn + 
\/log(l/<5) )/i/jL[], every vector w satisfies 

( , E f (w ■ (x - E F {x)) 2 ){\ - e) < Eg(w ■ (x - E F {x )) 2 ) 

< E F (w ■ (,x - E F (x)) 2 )(l + e), 

where Eg denotes the “sample mean”; that is, it stands for 

xSLL ' 


Proof. We may translate by —Ep{x) and without loss of generality 
assume that Ep(x ) is the origin. Suppose Q is the square root of the inverse 
of the variance-covariance matrix of F. We wish to prove, for all vectors w, 

Ep({w ■ x) 2 )(l - e) < Es({w ■ x) 2 ) < Ep{fw ■ x) 2 )(l + e). 


Putting Q 1 w = u (noting that Q is nonsingular and symmetric), this is 
equivalent to saying, for all vectors u, 

E f ((u ■ (Qx)) 2 )( 1 -e) < E s ((u- ( Qx )) 2 ) < Ep((u- (Qx)) 2 )( 1 + e). 


But Qx is a random sample drawn according to the standard normal, so 
it suffices to prove the statement for the standard normal. To prove it for 
the standard normal, we proceed as follows. First, for each coordinate i, we 
have that Ep(\xi\ 2 ) = 1 and using properties of the standard one-dimensional 
normal density, for any real s > 0, 

Prob(|£ , s(|a;j| 2 ) — 1| < s) > 1 — e~^ s2 ^. 


Now consider a pair i,j E {1,2,..., n}, where i j. The random variable 
XiXj has mean 0 and variance 1. Eg(xiXj ) is the average of N i.i.d. samples 
(each not bounded, but we may use the properties of the normal density 
again) concentrated about its mean: 

Prob^sIzjZjl < s) > 1 — e~l L l s2//100 . 


Putting s = 10 ^" , we see that all these 0(n 2 ) upper bounds hold simul¬ 
taneously with probability at least 1 — 5 /n 8 . 

Thus we have that the “moment” of inertia matrix M of S whose i,j 
entry is Eg{xiXj) has entries between 1 — \e and 1 + on its diagonal and 
the sum of the absolute values of the entries in each row is at most e/2. 
Thus by standard linear algebra (basically arguments based on the largest 
absolute value entry of any eigenvector), we have that the eigenvalues of M 
are between 1 — e and 1 + e, proving what we want. □ 
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4. Max-likelihood estimation. Now we describe an algorithm for max- 
likelihood fit of a mixture of k spherical Gaussians of equal radius to (possi¬ 
bly) unstructured data. First we derive a combinatorial characterization of 
the optimum solution in terms of the k-median ( sum of squares, Steiner ver¬ 
sion) problem. In this problem, we are given M points X\,X2, ■ ■ ■ ,xm £ 
in and an integer k. The goal is to identify k points pi,P 2 , ■ ■ ■ ,Pk that 
minimize the function 

M 

( 26 ) Yl\ x ~ Pc(j)f ’ 

2—1 

where P c (j) is the point among pi, ... ,pk that is closest to j and | • | denotes 
Euclidean distance. 

Theorem 13. The mixture of k spherical Gaussians that minimizes the 
log-likelihood of the sample is exactly the solution to the above version of 
k-median. 

Proof. Recall the density function of a spherical Gaussian of variance 
a (and radius (Ti/n) is 

1 ( \x — p\ 2 \ 

(27T C 7)V2 eXP ( v 2 cj2 )' 

Let xi,X 2 , ■ ■ ■ ,xm £ be the points. Let pi,P 2 , ■ ■ ■ ,Pk denote the centers 
of the Gaussians in the max-likelihood solution. For each data point Xj 
let Pc(j) denote the closest center. Then the mixing weights of the optimum 
mixture w\, W 2 , ■ ■ ■, Wk are determined by considering, for each i, the fraction 
of points whose closest center is p t . 

The log-likelihood expression is obtained by adding terms for the individ¬ 
ual points to obtain 

- Constant+ —log <r +V ^ ~ . 

2 . 2(t2 

l 3 

The optimum value a is obtained by differentiation, 

( 27 ) a2 = J^Y,\ x i-PcU) l 2 > 

3 

which simplifies the log-likelihood expression to 

Mn „ Mn 
Constant -|—— log o -\——. 

Thus the goal is to minimize a, which from (27) involves minimizing the 
familiar objective function from the sum-of-squares version of the /c-median 
problem. □ 
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We indicate how to use known results about the /c-median to provide 
a constant additive factor approximation to the log-likelihood. Charikar, 
Guha, Tardos and Shmoys [4] provide an 0(1) approximation to the k- 
median problem with sum-of-squares distance. They do not indicate if their 
algorithm works when the centers are not one of the sample points. However, 
the triangle inequality implies that picking the center to be one of the sample 
points changes the objective of the ^-median problem by at most a factor 
4. Hence we obtain a constant factor approximation to a 2 , and hence an 
approximation to log(<r) that is correct within an 0(1) additive factor. More 
efficient algorithms for fc-median are now known, so there is some hope 
that the observations of this section may lead to some practical learning 
algorithms. 

5. Conclusions. Several open problems remain. The first concerns solv¬ 
ing the classification problem for Gaussians with significant overlap. For 
example, consider mixtures of spherical Gaussians with pairwise intercenter 
distance only 0(max{cri, 02 }). In this case, a constant fraction of their prob¬ 
ability masses overlap, and the solution to the classification problem is not 
unique. Our algorithm does not work in this case, though a recent spectral 
technique of Vempala and Wang [25] does apply. (However, it does not apply 
to nonspherical Gaussians.) 

The second problem concerns general Gaussians whose probability masses 
do not overlap much but which appear to coalesce under random projection. 
For example, consider a pair of concentric Gaussians that have the same axis 
orientation. (Of course, these axes are unknown and are not the same as the 
coordinate axes.) In n — 2 axis directions their variance is a 2 , and in the 
other remaining two directions their variances are 1, a and a, 1, respectively. 
If a = H(logn), the difference in the last two coordinates is enough to dif¬ 
ferentiate their samples with probability 1 — l/poly(n). But after projection 
to an 0(logn)-dinrensional subspace, this difference disappears. Hence nei¬ 
ther distance-based clustering nor projection-based clustering seems able to 
distinguish their samples. 

The third open problem concerns max-likelihood estimation, which seems 
to involve combinatorial optimization with very bizarre objective criteria 
once we allow nonspherical Gaussians. 

We suspect all the above open problems may prove difficult. 

We note that Dasgupta (personal communication) has also suggested 
a variant of the classification problem in which the sample comes from a 
“noisy” Gaussian. Roughly speaking, the samples come from a mixture of 
sources, where each source is within distance e of a Gaussian. We can solve 
this problem in some cases for small values of e, but that will be the subject 
of another paper. Broadly speaking, the problem is still open. 
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